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Abstract 

By representing the range of fair betting odds according to a pair of confidence set estimators, 
dual probability measures on parameter space called frequentist posteriors secure the coherence 
of subjective inference without any prior distribution. The closure of the set of expected losses 
corresponding to the dual frequentist posteriors constrains decisions without arbitrarily forcing op- 
timization under all circumstances. This decision theory reduces to those that maximize expected 
utility when the pair of frequentist posteriors is induced by an exact or approximate confidence set 
estimator or when an automatic reduction rule is applied to the pair. In such cases, the resulting 
frequentist posterior is coherent in the sense that, as a probability distribution of the parameter 
of interest, it satisfies the axioms of the decision-theoretic and logic-theoretic systems typically 
cited in support of the Bayesian posterior. Unlike the p-value, the confidence level of an interval 
hypothesis derived from such a measure is suitable as an estimator of the indicator of hypothesis 
truth since it converges in sample-space probability to 1 if the hypothesis is true or to otherwise 
under general conditions. 

Keywords: attained confidence level; coherence; coherent prevision; confidence distribution; decision 
theory; minimum expected loss; fiducial inference; foundations of statistics; imprecise probability; 
maximum utility; observed confidence level; problem of regions; significance testing; upper and lower 
probability; utility maximization 



1 



1 Introduction 



1.1 Motivation 



A well known mistake in the interpretation of an observed confidence interval confuses confidence as 
a level of certainty with "confidence" as the coverage rate, the almost-sure limiting rate at which a 
confidence interval would cover a parameter value over repeated sampling from the same population. 
This results in using the stated confidence level, say 95%, as if it were a probability that the parameter 
value lies in the particular confidence interval that corresponds to the observed sample. A practical 
solution that does not sacrifice the 95% coverage rate is to report a confidence interval that matches a 



95% credibility interval computable from Bayes's formula given some matching prior distribution (Ru 



bin 19841. In addition to canceling the error in interpretation, such matching enables the statistician 



to leverage the flexibility of the Bayesian approach in making jointly consistent inferences, involving, 
for example, the probability that the parameter lies in any given region of the parameter space, on 
the basis of a posterior distribution firmly anchored to valid frequentist coverage rates. Priors yielding 
exact matching of predictive probabilities are available for many models, including location models and 



certain location-scale models (Datta et al. 2000 Severini et al. 2002). Although exact matching of 



fixed-parameter coverage rates is limited to location models (Welch and Peers 1963 Fraser and Reid 



20021, priors yielding asymptotic matching have been identified for other models, e.g., a hierarchical 
normal model (Datta et al. 20001. For mixture models, all priors that achieve matching to second 



order necessarily depend on the data but asymptotically converge to fixed priors ( Wasserman 2000 1 



Data-based priors can also yield second-order matching with insensitivity to the sampling distribu- 



tion (Sweeting 20011. Agreeably, Fraser (2008b) suggested a data-dependent prior for approximating 



the likelihood function integrated over the nuisance parameters to attain accurate matching between 
Bayesian probabilities and coverage rates. These advances approach the vision of building an objec- 
tive Bayesianism, defined as a "universal recipe for applying Bayes theorem in the absence of prior 
information" (|Efron| 11998 1. 



Viewed from another angle, the fact that close matching can require resorting to priors that change 
with each new observation, cracking the foundations of Bayesian inference, raises the question of 
whether many of the goals motivating the search for an objective posterior can be achieved apart from 
Bayes's formula. It will in fact be seen that such a probability distribution lies dormant in nested 
confidence intervals, securing the above benefits of interpretation and coherence without matching 
priors, provided that the confidence intervals are constructed to yield reasonable inferences about the 
value of the parameter for each sample from the available information. 

Unless the confidence intervals are conservative by construction, the condition of adequately incor- 
porating any relevant information is usually satisfied in practice since confidence intervals are most 
appropriate when information about the parameter value is either largely absent or included in the in- 
terval estimation procedure, as it is in random-effects modeling and various other frequentist shrinkage 
methods. Likewise, confidence intervals known to lead to pathologies tend to be avoided. (Patholog- 
ical confidence intervals often emphasized in support of credibility intervals include formally valid 



confidence intervals that lie outside the appropriate parameter space (Mandelkern 20021 and those 



that can fail to ascribe 100% confidence to an interval deduced from the data to contain the true 
value (Bernardo and Smith 1994 1.) A game-theoretic framework makes the requirement more precise: 



for the 95% confidence interval to give a 95% degree of certainty in the single case and to support 
coherent inferences, it must be generated to ensure that, on the available information, 19:1 are approx- 
imately fair betting odds that the parameter lies in the observed interval. This condition rules out the 
use of highly conservative intervals, pathological intervals, and intervals that fail to reflect substan- 
tial pertinent information. In relying on an observed confidence interval to that extent, the decision 
maker ignores the presence of any recognizable subsets ( Gleser 2002 1 , not only slightly conservative 



subsets, as in the tradition of controlling the rate of Type I errors Casella (19871, but also slightly 
anti-conservative subsets. Given the ubiquity of recognizable subsets (Buehler and Feddersen 1963 



Bondar 19771, this strategy uses pre-data confidence as an approximation to post-data confidence 
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in the sense in which expected Fisher information approximates observed Fisher information (Efron 



and Hinkley 19781, aiming not at exact inference but at a pragmatic use of the limited resources 



available for any particular data analysis. Certain situations may instead call for careful applications 



of conditional inference ( Goutis and Casella 1995 Sundberg 2003 Fraser 2004 1 for basing decisions 



more directly on the data actually observed. 



1.2 Direct inference and attained confidence 

The above betting interpretation of a frequentist posterior will be generalized in a framework of 
decision to formalize, control, and extend the common practice of equating the level of certainty 
that a parameter lies in an observed confidence interval with the interval estimator's rate of coverage 
over repeated sampling. 

Many who fully understand that the 95% confidence interval is defined to achieve a 95% coverage 
rate over repeated sampling will for that reason often be substantially more certain that the true 
value of the parameter lies in an observed 99% confidence interval than that it lies in a 50% confidence 
interval computed from the same data (Franklin 2001 Pawitan 2001 pp. 11-12). This direct inference, 



reasoning from the frequency of individuals of a population that have a certain property to a level of 
certainty about whether a particular sample from the population, is a notable feature of inductive logic 
(e.g., Franklin 2001 Jaeger 2005 I and often proves effective in everyday decisions. Knowing that the 



new cars of a certain model and year have speedometer readings within 1 mile per hour (mph) of the 
actual speed in 99.5% of cases, most drivers will, when betting on whether they comply with speed 
limits, have a high level of certainty that the speedometer readings of their particular new cars of that 
model and year accurately report their current speed in the absence of other relevant information. 
(Such information might include a reading of 10 mph when the car is stationary, which would indicate 
a defect in the instrument at hand.) If the above betting interpretation of the confidence level holds 
for an interval given by some predetermined level of confidence, then coherence requires that it hold 
equally for a level of confidence given by some predetermined hypothesis. 

34- 



Fisher's fiducial argument also employed direct inference (Fisher 1945 Fisher 1973 



36, 57-58; Hacking 1965 Chapter 9; Zabell 19921. The present framework departs from his in its 



pp. 



applicability to inexact confidence sets, in the closer proximity of its probabilities to repeated-sampling 
rates of covering vector parameters, in its toleration of reference classes with relevant subsets, and in its 
theory of decision. Since the second and third departures are shared with recent methods of computing 



the confidence probability of an arbitrary hypothesis (S3. 2. 2 1, the main contribution of this paper is 
the general framework of inference that both motivates such methods given an exact confidence set 
and extends them for use with approximate, valid, and nonconservative set estimators and for coherent 
decision making, including prediction and point estimation. 

This framework draws from the theory of coherent upper and lower probabilities for the cases in 
which no exact confidence set with the desired properties is available. To allow indecision in light of 
inconclusive evidence, these non-additive probabilities have been formulated for lotteries in which the 
agent may either place a bet or refrain from betting or, equivalently, in which the casino posts different 
odds to be used depending on whether a gambler bets for or against a hypothesis. Confidence decision 
theory will be formulated for this scenario by setting an agent's prices of buying and selling a gamble 
on the hypothesis that a parameter 9 is in some set 0' £ according to the confidence levels of a 
valid set estimate and a nonconservative confidence set estimate that coincide with 0'. As a result, 
the hypothesis has an interval of confidence levels rather than a single confidence level. Equating the 
buying and selling prices reduces the upper and lower probability functions to a single frequentist 
posterior, a probability measure on parameter space 0, and thus reduces the interval to a point. 



1.3 Overview 

This subsection outlines the organization of the remainder of the paper while offering a brief summary. 



3 



After preliminary concepts are defined ([ 2.1 1, Section 2.2 presents the new framework for confidence- 
based inference and decision. The family of probability measures (frequentist posteriors) used in in- 
ference and decision can be stated in terms of coherent lower and upper probabilities and is thus 
completely self-consistent according to a widely accepted account of coherence derived from ideas of 
Bruno de Finetti (£2.3 1. This lays a foundation for decisions and for flexible inference about the truth 
of hypotheses without invoking the likelihood principle (§[2.4 2.5 I. The framework is compared to 



other versions of frequentist coherence based on upper and lower probabilities in Section |2.5| 

While reporting an interval level of confidence in a hypothesis has the advantage of honestly com- 
municating the insufficiency of the data to determine a single confidence level, such intervals are less 
useful in situations requiring the automation of decisions. Under such circumstances, the family of fre- 
quentist posteriors can be reduced to a single frequentist posterior by the use of exact or approximate 
confidence sets or by an automatic reduction rule ([3.1 1. For a single frequentist posterior, confidence 



decision theory is equivalent to the minimization of expected posterior loss ([3.2 I. As a probability 
measure on hypothesis space, the resulting frequentist posterior satisfies the same coherence axioms 
as the Bayesian posterior whether or not it is compatible with any prior distribution ([3.3). The 
important special case of a scalar parameter of interest provides an arena for contrasting frequentist 
posterior probabilities and p-values ([3.4). 

The confidence framework provides direct and simple approaches to common problems of data 
analysis, as will be illustrated by example in Sections |3.2| and |3.4.3| Examples include reporting 
probabilistic levels of confidence of the interval, two-sided null hypotheses required in bioequivalance 
testing, assigning confidence to a complex region, and assessing practical or scientific significance. 
Posterior point estimates and predictions that account for parameter uncertainty are also available 
without relinquishing the objectivity of the Neyman-Pearson framework. 

Section [4] concludes the paper by highlighting the main properties of the proposed framework. 



2 Confidence decision theory 
2.1 Preliminaries 

2.1.1 Basic notation 

The values of x A y and x V y are respectively the minimum and maximum of x and y. The symbols C 
and C respectively signify subset and proper subset, le' : — * {0, 1} is the usual indicator function: 
1 / (0) is 1 if 9 S 8' or if 9 <£ 0'. 

Angular brackets rather than parentheses signal numeric tuples. For example, if x and y are 
numbers, then (x, y) denotes an ordered pair, whereas (x, y) denotes the open interval {z : x < z < y} . 

Given a probability space (f2,S,P^) indexed by the vector parameter £ S 5 C ]R d , consider the 
random quantity X of distribution Pj and with a realization x in some sample set C W 1 . Without 
loss of generality, partition the full parameter £ into an interest parameter 9 E and, unless 9 — £, a 
nuisance parameter 7 6 T, such that £ 6 x T and Pq :1 = P^. 

Except where otherwise noted, every probability distribution is a standard (Kolmogorov) probabil- 
ity measure. An incomplete probability measure is a standard, additive measure with total mass less 
than or equal to 1. 

Let (0,-4) represent a measurable space and B([0, 1]) the Borel cr-field of [0,1]. The complement 
and power set of 0' are 0' and 2 e , respectively. The cr-field induced by C is a (C). 

2.1.2 Metameasure and metaprobability spaces 

The following slight extension of probability theory is facilitates a clear and precise presentation of the 
present framework. To avoid unnecessary confusion between single-valued probability and the specific 
type of multi-valued probability required, the former will be called "probability" in agreement with 
common usage, and the latter will be called "metaprobability," a term defined below. 
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Definition 1. Given a measurable space (0,*4) and a metameasure space, the triple M. = (0,_4, *p) 
with a family ^5 of measures, the metameasure V of M. is a function V from A to the set of all closed 
intervals of [0, oo) such that V (A) is the closure of 

{P(A):P£^} 

for each A £ A. The metameasure V is said to be degenerate if = 1 or nondegenerate if > 1. 

Definition 2. The metameasure V of a metameasure space M. = (0,A,^P) is a probability metamea- 
sure if each member of <P is a probability measure. Then M. is called a metaprobability space, and 
V (A) is called the metaprobability of event A for all A £ A The expectation interval or expected 
interval € (L) of a measurable map L : A — > M 1 with respect to a probability metameasure P on M 
is the closure of 

jy £ (0) cLP (t?) : P G *P 

In words, the expectation interval of a random quantity with respect to a probability metameasure 
is the smallest closed interval containing the expectation values of the random quantity with respect 
to the probability measures of the metaprobability space. 

2.2 Confidence measures and metameasures 

Particular types of confidence sets form the basis of the metameasure on which confidence decision 
theory rests. 

Definition 3. A set estimator for 9 is a function defined on 17 x [0, 1] . A set estimator is called valid 
if its coverage rate over repeated sampling is at least as great as p, the nominal confidence coefficient: 

P 6 (0ee(X;pj) >p 

for all £ £ 5 and p £ [0, 1] . A set estimator is called nonconservative if its coverage rate over repeated 
sampling is at no greater than the nominal confidence coefficient: 

P C (0ee(X;pj) <p 

for all £ £ 5 and p £ [0, 1] . A set estimator that is both valid and nonconservative is called exact. For 
some set C of connected subsets of C, a set estimator is called nested if it is a function : CI x [0, 1] — > C 
such that such that, for all x £ Cl, there is a C (x) C C such that (x; •) : [0, 1] — ► C (x) is bijective, 
6(z;;0) = 0, e(z;l) = 6, and 

e(x; Pl )CQ(x;p 2 ) (1) 

for all < pi < p2 < 1. Two nested set estimators Oi : CI x [0, 1] — > C and O2 : CI x [0, 1] — * C are 
dual if the ranges C\ (x) and Ci (x) of Oi (x; •) and O2 (x; •) induce the same cr-field, i.e., a (Ci (x)) = 
a (Ci (x)) , for each x £ CI. 

The desired metameasure will be constructed from two confidence measures in turn constructed 
from dual nested set estimators. 

Definition 4. Let O : CI x [0, 1] — ► C denote a nested set estimator and A x the cr-field induced by C (x) , 
the range of (x; •) for each x £ CI. Then, for all x £ CI, induces the probability space (<d,A x , P x ) 
and the confidence measure or frequentist posterior P x , the probability measure on A x such that 

0' e C (x) => 0' = (x; P x (©')) • (2) 

The probability P x (0') is the confidence level of the hypothesis that £ 0'. If is valid, nonconser- 
vative, or exact, then P x and P x (©') are likewise called valid, nonconservative, or exact, respectively. 
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The next result provides the confidence level of any hypothesis that 9 E ©' E A x as the sum of 
confidence levels given more directly by equation 

Proposition 5. For each x E f2, let (Q,A X ,P X ) be the confidence measure induced by the nested set 
estimator : f2 X [0, 1] — * C, and let C (x) be the range of © (x; •) . For some K E {1,2, . . . } , let 
©' = u f=i fc) where @fc € A x and 



i ^ 3 



e' n e' 



Then 



where 



P x (©') 



A" 

E 

k=l 



6+ = arg 



P x (e+)-P» (Gfc)) 



(3) 



inf 1 6" | 

0"eC(x),0'C0" 



and = <3t\e>' k for all k E {1,2,..., K} . 



Proof P x (©£) = P x (6' fe ) + (9fe ) and P x (uf =1 6' fc ) = ££ =1 P x (0' fe ) follow from the mutual 
exclusivity of the sets and from the additivity of the measure P x . □ 

Thus, since, for all k € {1, 2, . . . , K} , both 0+ and ©^ are in C (x) and since C (x) induces A x , 
equations ^ and ([3} can be used to calculate P x (0') for any 0' E A x . 

Definition 6. Consider the dual nested set estimators 0> : fi X [0, 1] — * C, which is valid, and 
0< : Cl X [0, 1] — > C, which is nonconservative. For every x E f2, let ^ denote the common c-field 
induced by each of the ranges of 0> (x; •) and 0< (x; •) . If P x is the valid confidence measure, the 
confidence measure induced by 0>, then P x (0') is called a valid confidence level of the hypothesis 
that 9 E 0'. For each x E f2, the dual nonconservative confidence measure P x and nonconservative 
confidence level P x (0') are defined analogously. On the metaprobability space 

M| i < = (0,^,{P|,P|}) ) 

called a confidence metameasure space, the probability metameasure V x is called the confidence 
metameasure induced by 0> and 0< given some x in f2. Accordingly, the confidence metalevel of 
the hypothesis that 9 E 0' is P 1 (0') for all 0' E A x . By the definition of metaprobability, any 
hypothesis 0' E A x has a confidence metalevel of 



V x (0') = [P| (©') A P| (©') , P| (©') V P| (©')] 



(4) 



Remark 7. The restriction to cr-fields with events common to valid and nonconservative confidence 
measures strongly constrains the choice of the estimators to ensure the ability to assign a confidence 
metalevel to any hypothesis of interest without a need for incomplete probability measures. The 
further flexibility of allowing multiple cr-fields in a class of measure spaces may be desirable in some 
applications. 

Strategies developed within more conventional frequentist frameworks provide guidance on the 
choice of which dual set estimators by which to induce the confidence metameasure. Extending the 
statistical model to incorporate information from the physics of experimental design and measurement 



can rule out many pathological set estimators as meaningless ( |McCullagh 2002 ] . For instance, the 
inclusion of transformation-group structure in the model leads to set estimators that exactly match 
Bayesian posterior credible sets under certain improper priors (Fraser 1968 Helland 2004| . Without 
taking advantage of extended models, Barndorff-Nielsen and Cox (1994 121-122, 132-133), Sprott 
(2000 pp. 75-76), and Brazzale et al. (20071 highlight advantages of incorporating information from 
the likelihood function into set estimators; cf. Section 12.51 
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Example 8 (normal distribution). For n independent random variables each distributed according to 
Pg i7 , the normal distribution with mean 9 and variance 7, the interval estimator 0" given by 

B a (x;p)=[p- 1 (a),p- 1 (p+a)] 

for all p £ [0, 1 — a] is nested and is an exact p (100%) confidence interval for 9, where a £ [0, 1] , p x (9') 
is the upper-tailed p-value of the hypothesis that 9 — 9', and p^ 1 is the inverse of p x . Since 0" is both 
valid and nonconservative, it is dual to itself, yielding the equality of the valid and nonconservative 
confidence measures P x > and P*<, each the distribution of 



t) 



T n -ia/^/n, 



where T n _i is the random variable of the Student t distribution with n — 1 degrees of freedom. Hence, 
the confidence metameasure V% induced by 0" is degenerate: 

(e,-4 x ,{P£>,.P*<}) - {Q,A x ,{Pa}) 

If 0' is an interval, then 

PZ(e')=p x (su P 0')-Minf0') 

for all x £ f2 and 0' £ A x , from which it follows that the confidence measure does not depend on 
the nested set estimator chosen and can thus be represented by P x . 

Special properties of degenerate confidence metameasures are given in Section [3j The next example 
involves a nondegenerate confidence metameasure. 

Example 9 (binomial distribution). Let Pg denote the binomial measure with n trials, success 
probability 9 £ 0, and C-corrected, upper-tail cumulative probabilities pc.x {&) — Pg (X > x) + 
CPq (X — x) , with C £ [0, 1] . Consider the family Tc — {0^ : a G [0, 1]} of nested set estimators 
such that 

Pi-c.x («) » Pax i a + P) P € (0, 1 - a] 

[0,1] P =l 
for all a £ [0, 1] , p £ 9t = [0, 1 - a] U {1} , x £ {0, 1, ...} = ft, where 

PC',A"')=0' Po, x (e')=a'. (5) 

Since the rates at which the valid (C = 0) and nonconservative (C = 1) interval estimators cover 9 are 
bound according to 

Po(0£O% (X;p))>p, 
P g (6£e« {X-p))<p, 

the sets Tq and T\ are valid and nonconservative families of nested set estimators, respectively, and for 
any a £ [0, 1] , the valid set estimator 0q is dual to the nonconservative set estimator 0", thus inducing 
the valid confidence measure P£ , the nonconservative confidence measure P x 1 , and the confidence 
metameasure V% on the tr-field B ([0, 1]) for each x £ Q. In order to weigh evidence in X = x for the 
hypothesis that < d' < 9 < 9" < 1, equation ^ furnishes 

P a,C ( Pl-C,x O) >PC,X ( a + PC,*) ) = PC,x, 

and, with equation (|3}, 

P^,c(W,e"}) = Plc({pI-ax( a )>Pax(* + Pax)}) ~ P«,c ({pl-ax(*) >Pax(* + p'ax)])(V 



Pax 



PC,: 
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binomial frequency parameter p = 2/3 



fi h> u. h ' > 



" : ' H > : 
' h > > 




< based on nonconservative CI 
> based on valid CI 
h based on half-correction 
mean over convex set 



5 10 20 
number of trials 



Figure 1: Confidence levels of the hypothesis that 8, the limiting relative frequency of successes, is 
between 1/4 and 3/4 as a function of n, the number of independent trials, with 9 — 2/3 as the unknown 
true value. In the notation of Example [9] the nonconservative confidence level is Pf ([1/4,3/4]) , the 
valid confidence level is Pq ([1/4,3/4]) , and the half-corrected level is Pjy 2 ([1/4,3/4]) . The confidence 
level averaged over the convex set is defined in Section |3.1| Sampling variation was suppressed by 
setting each number x of successes to the lowest integer greater than or equal to n9 instead of randomly 
drawing values of x from the (n, 9) binomial distribution. 



where 

P'c,x = Pc,x (8') - a 
Pc, x = P0,x(0-«- 

Since a drops out of the difference, let Pg = P^ c - For any 0' G S([0, 1]) , equations d6j and (jij 
specify the confidence metalevel of the hypothesis that 9 € 0'. To illustrate the reduction ofconfidence 
indeterminacy with additional observations, the boundary values of V x ([1/4,3/4]) are plotted against 
n in Fig. [T|for the 9 = 2/3 case. 



2.3 Coherence of confidence metalevels 

The confidence metameasure V x on confidence space JA% < models the reasoning process of an ideal 
agent betting on inclusion of the true parameter value in elements of A x , the cr-field of .M> <, with 
upper and lower betting odds determined by the coverage rates of the corresponding valid and non- 
conservative confidence sets. The coherence of the agent's decisions may be evaluated by expressing 
its betting odds in terms of upper and lower probabilities that lack the additivity property of Kol- 
mogorov's probability measures. Given the dual functions u : A x — > [0, 1] and v : A x — > [0, 1] such 
that 

U (e') + v(e\e') = i, (7) 
u(e'ue") >tt(e') + u(e"), 

»(9'U9") <v(G') + v(G") 
for all disjoint 0' and 0" in A x , the values u (©') and v (©') are the lower and upper probabilities 



(Molchanov 2005 §9.3) of the hypothesis that 9 € 0'. The decision-theoretic interpretation is that 
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u (0') is the largest price an agent would pay for a gain of lg (0') , whereas v (0') is the smallest price 



for which the same agent would sell that gain, assuming an additive utility function (Walley 19911. 
The duality between u and v expressed as equation ([7} means each function is completely determined 
by the other. 

The function u is called the lower envelope of a family of measures on A if 



«(0') 



inf P(O') 
Peq3 



for all 0' € A ( |Coletti and Scozzafava] |2Q02| §15.2; iMolchanov] ( |2005l §9.3)). Since the low er envelope 



of a fa mily of probability measures is a coherent lower probability (Walley 1991 §3.3.3; Molchanov 
(2005 §9.3)) and since {P>,P<} as specified in Definition M constitutes such a family, the agent 
weighing evidence for any hypothesis 8 € 0' by V x (0') , witn 0' £ A x , satisfies the minimal set of 
rationality axioms of Walley ( |1991| . It follows that the agent avoids sure loss by making decisions 



according to the lower and upper probabilities 

u (©') = P| (©') A P| (6') , 

u(e') = i-u(e\e'). 

Conversely, the framework of Section |2.2|can be presented starting with de Finetti's prevision and 



the related concept of coherent extension (Walley 1991 Coletti and Scozzafava 2002[ | as follows. An 
intelligent agent first sets its prices for buying and selling gambles on the hypotheses corresponding 
to the elements of C according to the confidence coefficients of valid and nonconservative nested set 
estimators. Then it extends its prices or previsions to the family of the two probability measures on 
the er-field induced by C in order to evaluate the probability of a hypothesis 8 S 0' for some 0' in the 
cr-field but not in C. This family in turn yields coherent lower and upper probabilities that equal the 
initial buying and selling prices whenever the latter apply, i.e., when the hypothesis is that 8 G 0' for 
some 0' G C. Thus, a Dutch book cannot be made against the agent. 



2.4 Decisions under arbitrary loss 

This section generalizes betting under 0-1 loss to making confidence-based decisions under any un- 
bounded loss function. Confidence metalevels do not describe the actual betting behavior of any 
human agent, but instead prescribe decisions, including amounts bet on any hypothesis involving 8, 
given that the agent will incur a loss of L a (8) for taking action a. 

According to a natural generalization of the Bayes decision rule of minimizing loss averaged over a 
posterior distribution, action a' dominates (is rationally preferred to) action a" if and only if 

V£' e <£(L a ,) ,£" e <£(L a „) : £' < £" 

3£' eC(L«0, £"e£ {L a „) -.£'<£", 

where both expectation intervals (Definition |2j are with respect to the same confidence metameasure 
V x . The confidence metameasures impose no restrictions on agent decisions other than restricting them 
to non-dominated actions. 

This use of the confidence metameasure in making decisions follows a previous generalization of 
maximizing expected utility to multi-valued probability. (Here, the utilities are expressed in terms 
of equivalent losses, as is conventional in the statistics literature.) Kyburg ( 1990| pp. 180, 231-234; 
2003 2006| and Kaplan (1996 §1.4) used the principle of dominance to make decisions on the basis 
of intervals of expected utilities determined by the expected utility of each probability measure: an 
action yielding expected utilities in interval A is preferred to that yielding expected utilities in interval 
B if at least one member of A is greater than all members of B and if no member of A is less than 
any member of B. 
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While multi- valued probabilities do not dictate how to choose one of the non-dominated actions in 
situations that demand a choice equivalent to deciding between accepting a hypothesis or accepting its 
alternative, they may prove more practical when indecision can be broken by additional considerations, 
as Walley (1991 pp. 161-162, 235-241) explained. In the case of a human agent, Kyburg (2003) argued 
for selecting among non-dominated actions on the basis of considerations that cannot be represented 
mathematically rather than selecting on the basis of an arbitrary prior distribution. 

If a single- valued estimate of le' (9) is needed for some 0' £ A, the indeterminacy supV^ (0') — 
inf V x (0') can quantify a set estimator's degree of undesirable conservatism; some ways to eliminate 
such indeterminacy by replacing a confidence metameasure with a confidence measure are mentioned 
in Section [3] If indeterminacy is removed, the above dominance principle reduces to the principle of 
minimizing expected loss ([3.2). 



2.5 Likelihood principle 

While in some cases the likelihood function can guide the construction of set estimators with desirable 
properties, as noted in Section [272] it plays no general role in confidence decision theory. Consequently, 
inference does not always obey the likelihood principle: some set estimators lead to values of evidential 
support and partial proof that depend on information in the sampling model not encoded in the 



likelihood function; cf. Wilkinson (19771 



An advantage of coherent statistical methods in general is the flexibility they give the researcher 
to simultaneously consider as many hypotheses and interval estimates for as desired. Although such 
versatility is usually presented as a consequence of the likelihood principle and Bayesian statistics, 



they are not needed to secure it once coherence has been established ([2,3|. 

That the proposed framework is not constrained by the likelihood principle distinguishes it from 
Peter Walley's W\ and W2, two inferential theories of indeterminate (multi- valued) probability intended 



to satisfy the best aspects of both coherence and frequentism (Walley 2002| . The coverage error rate 



of Wi tends to be much higher than the nominal rate in order to ensure simultaneous compliance 
with the likelihood principle. Although the principle often precludes approximately correct frequentist 



coverag 


e, more power 


Walley 


(20021 did not 



2002) 



Walley (2002 I did not report the degree of conservatism of W2, a normalized likelihood method. With 
a uniform measure for integration over parameter space, the normalized likelihood is equal to the 
Bayesian posterior that results from a uniform prior. 



3 Frequentist posterior distribution 



An important realm for practical applications of the above framework is the situation in which inference 
may reasonably depend only on a single confidence measure P x rather than directly on a confidence 
metameasure V x . That is possible not only in the special case of degeneracy due to the availability of a 
suitable exact nested set estimator (Example |sj , but can also be achieved either by transforming a non- 
degenerate confidence metameasure to a confidence measure ([3.1) or by approximating a confidence 



measure. Remark [16| concerns the latter strategy in the case of a scalar parameter of interest. 



Relying solely on a single confidence measure for inference and decision making ([3.2 I enjoys the 



coherence of theories of utility maximization usually associated with Bayesianism ([3.3). In the ubiqui- 
tous special case of a scalar parameter of interest, a single confidence level of a hypothesis is a consistent 
estimator of whether the hypothesis is true under more general conditions than is the p-value as such 



an estimator ([ 3.4 1 



3.1 Reducing a confidence metameasure 

Interpreting upper and lower probabilities as bounds defining a family of permissible probability 



measures, 



Williamson (20071 argued for minimizing expected loss with respect to a single distri- 



bution within the family instead of using outside considerations to choose among actions that are 
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non-dominated in the sense of Section 2.4 Consider the confidence metameasure space 



(Q,A X , {P>,P<}) of confidence metameasure V x for some x £ f2. A much larger family of mea- 
sures on A* such that {P X 7 P X } and <p have the same lower envelope u is the convex set 

!P = {P^ : D S [0, 1]} , 

where Pg = (! — £)) P> + DP X , thereby forming the metaprobability space 7W> < = (O, ^4^, 5)3) and 



Smith 



(1961 §11 



Wasserman 



(19901; 



Paris 



(1994 pp. 40-42). Since 



probability metameasure V x ; cf. 

px _ ^ ^.jj e measure px £ sp selected according to some rule is called a reduction of P^ . 

Effective reduction of V x to a single measure P x can be accomplished by averaging over ^5 with 
respect to the Lebesgue measure. That average of the convex set is simply the mean of the valid and 
nonconservative confidence measures: 



P x (©') = [ Pg (6') dD = (P| (6') + P| (6')) /2 - Pf/a (6') 
Jo 



(8) 



for all 6' e A x ; recall that P x /2 £ qj. 



Other automatic methods of reducing a metameasure to a single measure are also available. For 



example, the recommendation of Williamson (20071 to select the measure within the family that 
maximizes the entropy is minimax under Kullback-Leibler loss ( |GriinwaTd 2004 1 . 

Example 10 (Binomial distribution, continued from Example [9}. As the gray line in Fig. [ljindicates, 
the mean measure P x of the convex set ([8} yields a confidence level between those of the valid and 
nonconservative confidence measures, discarding the notable reduction in confidence nondegeneracy 
from n = 1 to n = 10 as irrelevant for action in situations that do not permit indecision. The 
approximate (half-corrected) confidence level also disregards nondegeneracy information, yielding in 
this special case the same levels of confidence as does P x . In contrast, the confidence metameasure 
records the nondegeneracy as the difference between the agent's selling and buying prices of a gamble 
with a payoff contingent on whether or not 9 £ [1/4, 3/4] , a difference that becomes less important as 
n increases. 



3.2 Confidence-based decision and inference 

3.2.1 Minimizing expected loss 

In a situation requiring a decision involving the acceptance or rejection of the hypothesis that 9 £ ©' , 
that is, under a 0-1 loss function, an agent guided by a single measure P x regards P x ($ £ O') jP x ($ </ 0') 
as the fair betting odds and will act accordingly. The hypothesis 9 £ 0' will be accepted only if the 
odds P x ($ £ 0') jP x (i? </ 0') are greater than the ratio of the cost that would be incurred if 9 </ 0' 
to the benefit that would be gained if 9 £ 0'. Otherwise, unless the odds are exactly equal to 1, the 
hypothesis 9^0' will be accepted. Under a more general class of loss functions, the decision theory 
of Section [2^4] reduces to the minimization of expected loss given the degeneracy or reduction of the 
confidence metameasure. Section |3.3.2| notes implications for axiomatic coherence. 



3.2.2 Applications to hypothesis assessment 

As the findings of basic science are arguably valuable even if never applied and since the ways in 
which any inductive inference will be used are often unpredictable (Fisher 1973 pp. 95-96, 103-106), 
P x (d £ 0') may be reported as an estimate of le' (9) for use with currently unknown loss functions 
(cf. Jeffrey 1986 Hwang 19921. That inferential role is currently played in many of the sciences by 
the p- value interpreted as a measure of evidence in "significance testing" (Cox 19771, but its notorious 
lack of coherence has prevented its universal acceptance (e.g., Royall 1997 1 . As will become clear in 
Section 3.4 P x (i? £ 0') can differ markedly from the p-value for testing 9 £ 0' as the null hypothesis 
not only in interpretation but also in numeric value. 
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Example 11. Efron and Tibshirani ( 1998| §3) consider the hypothesis that the mean £ of a v- 
dimensional multivariate normal distribution of an identity covariance matrix is in an origin-centered 
sphere of radius 9" but outside a concentric sphere of radius 8'. Let 9 — ||£||, and let xi be the 
chi-squared cumulative distribution function (CDF) of v degrees of freedom. Since the p- value of the 



null hypothesis that 9 > 9' is xl ((I M l/^') 2 J i the confidence level of the hypothesis that 9' < 9 < 9" 
is 

P* (9' < < 8") = xl ({\\x\\/9'f) xl {{M\/9"f 



the value of which Efron and Tibshirani ( 1998| §4) justified as an approximation to a Bayesian posterior 
probability. The coherence of the confidence measure P x immunizes it against the inconsistencies that 
Efron and Tibshirani (1998] §3) noticed among p- values: contradictory conclusions would be reached 
depending on which hypothesis was considered as the null. 

A practical implication of working in the confidence metameasure framework is that since the simple 



bootstrap methods of Efron and Tibshirani ( 1998 1 based on a scalar pivot enable close approximations 



to p- value functions (Efron 1993 Schweder and Hjort 2002 Singh et al. 2005 Xiong and Mu 20091 



they can solve related problems too complex for more rigid Neyman-Pearson methods and yet without 

(|2003| . Applications include assigning 



any need to seek matching priors for justification; cf. Efron 



levels of confidence to phylogenetic tree branches Efron et al. 



estimated function (Efron and Tibshirani 1998 



the basis of microarray data (Kamimura et al 



to observed local maxima in an 
Hall| |2004|, and to gene network connections found on 



fll996| ), 



20031. Liu (19971 studied operating characteristics of 



the empirical strength probability (ESP), which in the one-dimensional case is equal to some confidence 
probability P x (9' < •& < 9") defined with respect to a bootstrap algorithm. 

See Polansky] (20071 for an accessible introduction to the general problem of "observed confidence 
levels" of composite hypotheses, which Efron and Tibshirani (19981 had dubbed the "problem of re- 



gions," understood to include applications to ranking and selection as well as those mentioned above. 
The fundamental characteristic of this approach is not the bootstrapping technique as much as the 
property that the level of confidence in any given region is equal to the coverage rate of a corresponding 
confidence set. Until the ESP is seen to have a compelling justification of its own, it may continue 
to be regarded merely as a method of last resort since it is in general neither a Bayesian posterior 
probability nor a Neyman-Pearson p-value: "For [the latter] reason, it seems best to use the ESP only 



when more specific, direct testing methods are not available for a particular problem" (Davison et al 



2003). That the ESP and other approximations of the confidence value are more acceptable than 



p-values as estimates of whether the parameter lies in a given region (S 3.4.41 gives cause to reconsider 
that judgment even apart from the coherence of the confidence value. 

Example 12 (beyond statistical significance). Consider the null hypothesis 9' — A < 9 < 8' + A, 
where the non-negative scalar A is a minimal degree of practical or scientific significance in a particular 
application. For instance, researchers developing methods of analyzing microarray data are increasingly 
calling for specification of a minimal level of biological significance when testing null hypotheses of 



equivalent gene expression against alternative hypotheses of differential gene expression (Lewin et al 



2006 


Van De Wiel and Kim 


2007 


and 


McCarthy and Smyth 


( 


2009) in 



hypotheses, in conflict with the confidence measure approach (Example 18 and Section 3.4.41. 



Section [3.4.3| provides additional examples that contrast hypothesis confidence levels with hypoth- 
esis p- values in practical applications. 

3.2.3 Other applications of minimizing expected loss 



The framework of minimizing expected loss with respect to a confidence measure (S 3.2. 1 1 not only 



leads to assigning confidence levels to hypotheses but also provides methods for optimal estimation 
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and prediction. In addition, confidence-measure estimators and predictors have frequentist properties 
only shared with Bayesian estimators and predictors when the Bayesian posterior is a confidence 
measure. 

As the frequentist posterior, the confidence measure gives all the point estimators provided by 
the Bayesian posterior. For example, the frequentist posterior mean, minimizing expected squared 
error loss, is d x = J Q -ddP x (■&) and the frequentist posterior p-quantile, minimizing expected loss for a 
threshold-based function of p (Carlin and Louis 2009 App. B), is -d (p) such that p = P x ($ < i3 (p)) . 
Assuming a different iable CDF of P x , Singh et al. ( 2007 1 proved the weak consistency of the frequentist 
posterior median § (1/2) and the frequentist posterior mean d x and proved that the former is median- 
unbiased. In that case, the frequentist mode, the value maximizing the probability density function of 
§, is also available if a unique maximum exists. 

The frequentist posterior predictive distribution, the frequentist analog of the Bayesia n posterior 
predictive distribution of a new observation of X, is P' 1 ' = J Q P$^dP x ($) for all x £ fL (Dawid and 



Wang (1993), van Berkum (19961, and Hannig (20091 considered this with fiducial- like distributions in 



place of the confidence measure P x .) Appropriate point predictions are £ x = JqX (lj) dP^ x > {lo) in the 
"regression" case of continuous f2 and £ x = l[i/2,i] (P^ (X = 1)) in the "classification" case in which 
Q = {0, 1}. If P x is approximated using a bootstrap algorithm as in Section 3^2.2 then the resulting 
values of £ x and £ x are bootstrap aggregation (bagging) predictions; Breiman (19961 found bagging 
to reduce prediction error. The confidence predictive distribution can also be used to determine sizes 
of new studies by accounting for uncertainty in the effect size. (The classical method of determining 
the sample size of a planned experiment is often criticized for relying on a point estimate of the effect 
size.) 



3.3 Confidence versus Bayesian probability 



As the examples of Section [33] illustrate, many uses of Bayesian posterior distributions are completely 
compatible with confidence measures since both distributions of parameters deliver coherent inferences 
in the form of probabilities that hypotheses of interest are true. However, to the extent that updating 
parameter distributions in agreement with valid confidence intervals conflicts with updating them 
by Bayes's formula, confidence decision theory differs fundamentally from the two dominant forms 
of Bayesianism, "subjective" Bayesianism, which is seldom used by the statistics community, and 
"objective" Bayesianism broadly defined as a collection of algorithms for generating prior distributions 
from sampling distributions or from invariance arguments. Nonetheless, the proposed framework 
follows from an application of de Finetti's theory of prevision to an agent that makes decisions according 



to certain confidence levels (£2.3 1 



3.3.1 Bayesian conditioning 



As demonstrated in Section |2.3| the proposed framework for frequentist inference satisfies coherence, 
which does not require the probability distribution of the parameters to correspond to any Bayesian 
posterior distribution, a prior distribution conditional on the observed data in the Kolmogorov sense, 
as is frequently supposed. Not coherence but another pillar of Bayesianism mandates that the posterior 
distribution, i.e., the parameter distribution used for decisions after making an observation, must equal 



the prior distribution conditioned on the observation (Goldstein 1985). That assumption, usually 



implicit, has been stated as a plausible principle of learning from data: 

Definition 13 (Bayesian temporal principle). Consider the prior distribution n, a probability measure 
induced by a random vector ■!? in 0, the parameter space. Let the update rule Tr' m denote a function 
mapping f2, the sample space, to a set of probability measures, each defined on 0. If, for all x' £ fi, 
the posterior distribution ir' x , induced by random quantity d' x , in is the conditional distribution of 
$ given X' = x', then Tr' t satisfies the Bayesian temporal principle, ix' x , is called a Bayesian posterior 
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distribution, and the equivalence between the posterior and conditional distributions is written as 



Remark 14. In the one-dimensional case, the Bayesian temporal principle stipulates that, for all 0' C 

e, 

< (tf e e') = 7T (i? e G'\x' = x') , 

where 7r£. and 7r are the posterior and prior distributions of i?' and respectively. Adding a prime 
symbol (') for each successive observation gives d' x , = = $' x ,\x", = t?"„|x"', and so 

forth. Goldstein (20011 coined the name of the principle, explaining that it unreasonably requires 



that an agent's conditional betting odds (prior odds conditional on a contemplated future observation) 
determines its future betting odds (posterior odds as a function of the actual observation). In other 
words, the current rate of machine learning is limited by the previous strength of machine belief. 



Goldstein (20011 pointed out that although Bayesians follow the temporal principle when using 



Bayes's formula, they disregard it every time they revise a prior or sampling model upon seeing new 
data. Such revision occurs whenever posterior predictions are subjected to frequentist model checking 
procedures such as cross validation. One rationale for revising the prior is that poor frequentist 
performance may indicate that it did not adequately reflect the available information as well as it 
might have had it been more carefully elicited. Another is the receipt of new information that cannot 



be represented in the probability space of the initial prior (Diaconis and Zabell 1982). 



3.3.2 Non-Bayesian coherence 



Confidence decision theory not only satisfies coherence in the sense of avoiding sure loss (£2.31, but 



when reduced to the minimization of expected loss with respect to a single confidence measure ([3.2 1 



is also coherent in the sense of axiomatic systems of expected utility maximization (von Neumann 



and Morgenstern 1944 Savage 19541. While both approaches to coherence support the concept of 



placing bets in accord with the laws of probability, including conditional probability for called-off bets, 
none of the approaches entails the equality of conditional probability as defined by Kolmogorov and 
posterior probability as the hypothesis probability updated as a function of observed data. Replacing 
probabilities with proposition truth values and conditional probabilities with theorems (statements 
of implication) furnishes an illustration from deductive logic (Jeffrey 1986): an agent whose set of 



propositions held to be true do not contradict each other at any point in time is completely self- 
consistent. However, the agent cannot comply with the deductive version of the Bayesian temporal 
principle unless none of the truth values ever requires revision (Howson 19971. As a finitely additive 
probability distribution, the confidence measure also agrees with axiomatic systems of probabilistic 



logic such as that of Cox (1961 1. 



The above accounts of coherence provide no support for the Bayesian temporal principle since their 
theorems involve conditional probability, not posterior probability as specified by some update rule 
Simply defining the posterior distribution to be Kolmogorov's conditional distribution given the data 
either specifies nothing about how parameter distributions are updated with new data or conceals the 



assumption of the Bayesian temporal principle (Hacking, 19671 



Even though the statistical literature refers to many theorems supporting coherence and rationality 
as understood in Section [273] discussion of the foundational principle of Bayesianism has instead taken 
place mostly in the philosophical literature 



David Lewis (Teller 



of the Dutch book game ([2.3 1 into one in which the gambler knows the rule the casino agent uses 
to update its betting odds on receipt of new information. In that game 



1973|| presented a transformation 
ie the casino agent uses 
but not in the original 



Dutch book game, violation of the Bayesian temporal principle leads the casino to sure loss (Teller 



1973 Vineberg 19971. Since such violation occurs over time, it is considered a breach of diachronic 



game-theoretic coherence, a restriction on the degree to which an agent's betting odds can change over 
time, as opposed to synchronic game-theoretic coherence, a consistency in an agent's betting odds at 
any given time (Armendt 19921. Accordingly, the Dutch book arguments for diachronic coherence 
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have been considered much weaker (Maher 19921 Goldstein 2006 Williamson 20091 than those for 



synchronic coherence, the type of coherence supported by the theorems of de Finetti ( 1970 1 and Savage 



([1954}. |Goldstein| ( |1997| l , |Hacking| ( |200l| pp. 256-260), and |Williamson| ( |2009} , while accepting Dutch 
book arguments for synchronic coherence, do not consider diachronic coherence to be a requirement of 



logical thought. Hild (19981 distinguished game-theoretic diachronic coherence from decision-theoretic 
diachronic coherence, arguing that the latter rules out the Bayesian temporal principle as incoherent. 
Another difficulty is that some Dutch book arguments lead to versions of diachronic coherence that 



conflict with the Bayesian temporal principle (Armendt 19921. 



In summary, the theorems routinely presented as proof that all rational thought or coherent de- 
cision making must be Bayesian actually prove no more than the irrationality of violating the logic 
of standard probability theory. Thus, any decision-theoretic framework representing unknown values 
as random quantities mapped from some probability space stands on equal ground with Bayesianism 
as far as the minimal requirements of rationality are concerned. Such frameworks include geometric 
conditioning ( Goldstein 2001 1, probability kinematics (Diaconis and Zabell 1982 Jeffrey 20041, dy- 



namic coherence (Skyrms 



Jaeger 2005 Williamson 



1997 Zabell 20021, and relative entropy maximization (Griinwald 



20091 as well as confidence decision theory (S 3.4 1 



2004 



3.3.3 Objections to frequentist posteriors 

Since, neglecting sufficiency and ancillarity considerations, the confidence level is numerically equal to 
the fiducial probability in the case of a one-dimensional parameter of interest given continuous data 



(Wilkinson 1977), some classical Bayesian objections against the coherence of fiducial distributions 
apply with equal force against the coherence of the confidence measure. The strength of such arguments 
is now evaluated in light of the above distinction between axiomatic coherence and the Bayes update 
rule. 

In the present framework, confidence-based or fiducial probabilities of hypotheses correspond to rea- 



sonable betting odds, a consequence that Cornfield (19691 considered impossible since Lindley (19581 



had demonstrated that fiducial distributions are Bayesian posteriors only in certain special cases and 
since placing conditional bets contrary to conditional probability leads to certain loss. The conclusion 



drawn by Cornfield ( 1969 1 would only follow under the widely held but incorrect assumption that 



a parameter distribution must be a Bayesian posterior for it to satisfy coherence. Lindley (19581 



extending the work of Grundy (19561, actually had found conditions under which the fiducial distribu- 
tion violates the Bayesian temporal principle considered in Section [33] not that a conditional fiducial 
distribution is incompatible with the definition of a conditional probability distribution. 



Lindley (1958) also demonstrated that violation of the Bayesian temporal principle means the 



pivot is not unique, leading to non-unique fiducial distributions. In light of the subsequent failure of a 



generation of statisticians to identify any genuinely noninformative priors (Dawid et al. 1973 Walley 



1991 pp. 226-235; Kass and Wasserman 


1996 


Helland 


2004 


posteriors lack uniqueness as well ( Fraser 


2008a 


Hannig 


2009 



Just as given a prior, sampling model, 
and data, all inferences made using the resulting Bayesian posterior measure are coherent, so given 
an exact estimator, sampling model, and data, all inferences made using the resulting confidence or 
fiducial measure are equally coherent. Thus, the selection of frequentist set estimators parallels the 
selection of priors, and in each case such selection may depend on the intended application. Section 
|2.2| points to reasonable criteria for such selection. 



3.4 Scalar subparameter case 

The equality between tail probabilities of a confidence measures and p-values will be used to prove a 
consistency property that holds under more general conditions for a confidence level than for a p-value 
as estimators of composite hypothesis truth. 
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3.4.1 Confidence CDF as the p-value function 



If decisions are based on a single confidence measure of a scalar parameter of interest, then the CDF 
of that measure is an upper-tailed p-value function. 

Definition 15. Consider a function p + : x — * [0, 1] such that p + (x, •) = p+ (•) is a CDF for all 
x £ and such that 

P € (p+ (6) <a)=a (9) 

for all 9 £ 0, £ <E S, and a £ [0, 1] . Then, for any x £ f2, the map p+ : O — * [0, 1] is called an upper-tail 
p-value function for 9. Likewise, p~ : — > [0, 1] is called a lower-tail p-value function if 



p-(fl) = l-p+ (6») 



(10) 



for all 9 £ and for all x £ £1. 



Uniformly distributed under the simple null hypothesis that 9 — 9' , p x (9') and p+ (9') are exact 



p-values of one-sided tests. Since equation (10 1 is an isomorphism between the two p-value functions, 
the pair (p~ (9') ,p+ (9')) will be called the p-value function, either element of which may be designated 
by p x (9') . The two-sided p-value of the null hypothesis that 9 is in a central region 0' of is 

Px (0')=2 sup p7 x (9')Apt {9') 
s'ee' 

for all x £ fl, reducing to the usual p x (0') = 2p~ (9') A p x (9 1 ) for the point hypothesis that 9 = 9' . 



While the name p-value function used by Fraser ( 1991 1 has become standard in the scientific 



literature, significance function is also used in higher-order asymptotics (e.g., Brazzale et al. (2007 1 ) 
Efron ( 1993 1 , Schweder and Hjort ( 2002 I , and Singh et al. ( 2007 1 prefer the term confidence distribution 



avoided here to clearly distinguish the p-value function from the confidence measure as a Kolmogorov 
probability distribution. (Whereas any p-value function is isomorphic to a unique confidence measure 
as defined in Sectio n |2.2| the p-value function can also be isomorphic to an incomplete probability 
measure. Wilkinson ( 1977 ) constructed a theory of incoherence based on such a measure, underscoring 



the need to sharply distinguish confidence measures from p-value functions.) 

By the usual concept of statistical power, the Type II error rate of associated with testing the 
false null hypothesis that 9 = 9' at significance level a is j3 (a, 6, 9') 
For all a.\, a 2 £ [0, 1] such that a± + OL2 < 1, 



P$ {Px (#0 > a ) for an y 9 ^ 



P 5 (ai < p x (9) < 1 - a 2 ) = 1 - at - a 2 , 
implying that 9 X : [0, 1] — * 0, the inverse function of p x , yields ( 9 X (oi) , 9 X (1 — 0:2)) as an exact 



100 (1 - a x - a 2 ) % confidence interval ( |Fraser| |1991| |Efron| |1993| |Schweder and Hjortj |2002| |Singh 
et al.||2007| . 



Remark 16. In many applications, approximate p-value functions replace those that exactly satisfy 
the definition. For instance, Schweder and Hjort (20021 use a half-corrected p-value function like 
pc,x °f Example [9] for discrete data. Other approximations involve parameter distributions with 



asymptotically correct frequentist coverage, including the asymptotic p-value functions of Singh et al.| 
(2005 I, the distributions of asymptotic generalized pivotal quantities of Xiong and Mu ( 2009| , some of 
the generalized fiducial distributions of Hannig (20091, and the Bayesian posteriors of Section [LT| As 
with frequentist inference in general, asymptotics provide approximations that in many applications 



prove sufficiently accurate for inference in the absence of exact results ( Reid 2003 1 . 



3.4.2 Interpretations of the p-value function 

In its history, the p-value function has had Neymanian, Fisherian, and Bayesian interpretations. Con- 
sistently viewing the p-value function within the Neyman-Pearson framework rather than as the CDF 
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of a probability measure of 9, Fraser (1991 1, Schweder and Hjort (2002 I, Singh et al. (2005 I, and Singh 
et al. (20071 have used p + to concisely present information about hypothesis tests and confidence 



intervals in data analysis results. The p-value function thus interpreted as a warehouse of results of 
potential hypothesis tests and confidence intervals has also uncovered relationships with the Bayesian 
and fiducial frameworks (Schweder and Hjort 20021. Schweder and Hjort (20021 aimed "to demon- 
strate the power of the frequentist methodology" by means of reporting on the p-value function and 
likelihood function as key components of a unified Neyman-Pearson alternative to Bayesian posterior 
distributions, which can fail to yield interval estimates guaranteed to cover true parameter value at 
some given rate. Interestingly, the incipient p-value function had been originally conceived as a Fishe- 



rian alternative to what was seen as a mechanical use of the Neyman-Pearson confidence interval ( Cox 



19581. 



In a move away from both of the main frequentist interpretations of the p-value function, |Efron| 



( 1993 I proposed a simple, fast algorithm for computing an implied prior density and an implied like- 
lihood from a confidence density assumed to be proportional to a Bayesian posterior density. He 
reported that with a confidence density based on an exponential model and the ABC confidence in- 
terval method, the disagreement between the implied likelihood and the true likelihood observed by 



Lindley (19581 "is small in most cases," with the implication that the confidence density approximates 
a Bayesian posterior, thereby establishing approximate coherence. However, while compatibility with 
a Bayesian posterior is sufficient for coherence, it is by no means necessary ( §S 2.3 3.3 1. 

Dropping the requirement of approximating a Bayesian posterior enables more exact frequentist 
coverage in many instances without sacrificing the coherence achieved by Efron (19931. The concept of 
coherence is itself sufficient to recast the p-value function from a pure Neyman-Pearson toolbox into a 
versatile weapon for statistical inference and decision making, enabling all of the applications available 
to a Bayesian posterior distribution of the interest parameter, marginal over any nuisance parameters 
(cf. 



Efron 19981 



In addition, information in the form of a subjective prior distribution can be incorporated into 
frequentist data analysis by combining the prior with the p-value function ( Bickel 2006 1 under the 
following circumstances. Suppose Agents A and B each bases the posterior probability measure by 
which it makes decisions (£3.2 ) on confidence sets according to the framework of Section 2.3 whenever 
the observation that X = x constitutes the only information about the parameter of interest. Agent 
A observes x, which would yield the confidence measure P x on (Q,A) , but it also has independent 
information in the form of Q, a probability measure on (Q,A) elicited from Agent B, where 6CM 1 . 
Since Agent B would have set Q to equal a confidence measure if possible, Agent A processes Q 
exactly as it would a confidence measure computed on the basis of data independent of X. Since each 
of several methods of combining p- value functions from independent data sets yields an approximate 



p-value function incorporating information from both data sets (Singh et al. 2005 I, Agent A bases its 
decisions on P x Q, the probability measure of the CDF obtained by applying any such combination 
method to the CDFs of P x and Q. It follows that if Q is in fact a confidence measure, then P x © Q is a 
confidence measure to the same degree of approximation as the combined CDF is a p- value function. 
Agents A and B may actually be the same agent, which would be the case if Agent A had computed 
the prior Q as a confidence measure on the basis of independent data that are no longer available. In 
conclusion, the presence of important information in the form of a prior probability distribution on 
(6,^4) does not in itself necessitate moving from confidence-based statistics to Bayesian statistics. 



3.4.3 Confidence levels versus p-values 

Although both confidence levels and p-values can be computed from the same p-value function, the 
following examples illustrate how they can lead to different inferences and decisions. Section [3.4.4| then 
demonstrates that the former but not the latter are consistent as estimators of composite hypothesis 
truth. 

Example 17 (point null hypothesis). If P x ($ < •) is continuous on 0, then P x (9 = 9') = for any 
interior point 9' of O. This means that given any alternative hypothesis 9 £ O' such that P x (9 E O') > 
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0, betting on 8 = 8' versus 8 € 0' at any finite betting odds will result in expected loss, reflecting 
the absence of information singling out the point 8 — 9' as a viable possibility before the data were 
observed. (By contrast, the usual two-sided p-value is numerically equal to p x (8') , which does not 
necessarily equal the probability of any hypothesis of interest.) If, on the other hand, the parameter 
value can equal the null hypothesis value for all practical purposes, that fact may be represented by 
modeling the parameter of interest as a random effect with nonzero probability at the null hypothesis 
value. The latter option would define the confidence measure such that its CDF is a predictive p-value 
function such as that used by Lawless and Fredette ( 2005 1 . 



Example 18 (bioequivalence) . Regulatory agencies often need an estimate of l[e'-A,0'+A] (8) , the 
indicator of whether the hypothesis that the continuous parameter of interest lies within A of 9' for 
some A > 0; a value common in bioequivalence studies is A = log (125%) with exp (9') as the efficacy of 
a medical treatment. For the purpose of deciding whether to approve a new treatment or a genetically 
modified crop, estimates provided by companies with obvious conflicts of interest must be as objective 
as possible. The Neyman-Pearson framework in effect enables conservative tests of the null hypotheses 
9 £ [6' - A, 6' + A], 9 < 9' - A, and 8 > 8' + A ( |Wellek| |2003| l but without guidance on how to use 
the resulting p-values p x (9') , p+ (8' — A) , and p~ (9' + A) to make coherent decisions, which would 
instead require estimates of l(_oo,e'-A) ($) > he'-A,e'+A] ($) > an d 1(0'+A,oo) (@) such that the sum of 
the estimates is 1. The probabilities P x (i9 < 8' - A) , P x {9' - A < < & + A) , and P x (■&>& + A) 
qualify as such estimates without suffering from the subjective or arbitrary nature of assigning a prior 
distribution. Due to the coherence of probabilistic indicator estimators, regulators may simultaneously 
consider more complex estimates such as P x {d > 9' + A|# ^ [9' — A, 8' + A]), the probability that 
the effect size is high given that it is non-negligible, without the multiplicity concerns that plague 
Neymanian statistics ({2.5|. Singh et al. ( 2007| also compared the use of observed confidence levels to 
conventional methods of bioequivalence. 



3.4.4 Consistency of hypothesis confidence 

More terminology will be introduced to establish a sense in which the confidence value but not the 
p-value consistently estimates the hypothesis indicator. 

Definition 19. An indicator estimator 1 is consistent if, for all O' € A, 

l e , {X ) 1 @ , (8) 

for every 7 S T and for every 9 that is an element of but not of the boundary of 0'. 

By the usual concept of statistical power, the Type II error rate of p ± associated with testing the 
false null hypothesis that 8 = 6' at significance level a is /3 ± (a, 9, 9') = P$ n (p^ ($') > a ) f° r an y 
8 ^ 8'. Commonly used in two-sided testing, the two-sided p- value of the null hypothesis that 9 G 0' 
is for all 0' C and x G Q. 

The next two propositions contrast the consistency of the confidence value with the inconsistency 
of the two-sided p-value. 

Proposition 20. Assume all one-sided tests represented by the p -value functions p ± are asymptotically 
powerful in the sense that Vaiin^^ (3 ± (a, 9, 9') = for all a € (0, 1) and for all 9,9' E such that 
9^9'. The function 1 : A.X il — * [0, 1] is a consistent indicator estimator if P x = 1, (x) is a confidence 
measure corresponding to p given X = x for all x € fi. 

Proof. By the definition of the boundary of a set 0' as the difference between its closure 0' and its 
interior int 0', the theorem asserts that, for all 0' € A, 6 is either in int 0', in which case the theorem 
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asserts P x (©') 



1, or 9 is in 0\0', in which case the theorem asserts P x (©') 



Pe, 



0. Let A' 



represent the set of all disjoint open Each term of the sum expands as 



p x (&") = p x ((infe"',supe"'))=^(supe"')-pi(infe"') 

= p x (inf Q'")-p x (sup &") 

= i-p x (supe"')- Px (inie'"). 



As the p-value functions are asymptotically powerful, p x (9') 



for all a g (0,1) and for all 



9,9' G such that 9 ^ 9', with the result that each term may be written as a function of p-values 
that converge in Pg tl to 0: 



P x (&") 



Px (inf 6"') - Px (sup 9"') 9 < inf 0"' 
1 - p~ x (sup 6"') - p\ (inf 6"') 9 e 6"' 
jp\ (sup 6"') - p+ (inf 6"') 9 > sup 6"' 

'0-0 6»<inf0'" 
1-0-0 9 e 0"' 
0-0 6»>sup0'" 



for all O'" E A' . Summing the terms over A' yields 



P x (©') 



e'"eA' 



l & „ (9) = l @ , (9) 



since 9 S int 0' implies that 9 is in one element of A'. 



□ 



Remark 21. Polansky (2007 pp. 37-38) proved a similar proposition of consistency given a smooth 



distribution Pg 7 . A suitably transformed likelihood ratio test statistic is also a consistent indicator 



estimator under the standard regularity conditions (Bickel 20081 



Proposition 22. Under the conditions of Theorem 20 the two-sided p-value px (&') is not a consis- 
tent indicator estimator. 

Proof. For any 9 £ 0' € A, the distribution of the two-sided p-value px (0') converges to the uniform 
distribution on [0,1] (Singh et al. 2007), violating consistency (Definition jig). □ 



4 Discussion 

The confidence metameasure V x and the confidence measure or frequentist posterior P x bring both 
coherence and consistency to frequentist inference and decision making. 

The coherence property established in Section |2.3| confers the ability to consistently and directly 
report the levels of confidence of as many complex hypotheses as desired and to perform estimation 



and prediction (S3. 2 I. Even though the frequentist posterior P x is a flexible distribution of possible 
values of a fixed parameter, it requires no prior; in fact, P x need not even necessarily correspond to any 
Bayesian posterior distribution (S3.3I. In conclusion, the metalevel or level of confidence in a given 



hypothesis has the internal coherence of the Bayesian posterior or class of such posteriors without 
requiring a prior distribution or even an exact confidence set estimator. 

More can be said if the parameter of interest is one-dimensional, in which case the confidence level 
of a composite hypothesis is consistent as an estimate of whether that hypothesis is true, whereas 



neither the Bayesian posterior probability nor the p-value is generally consistent in that sense ([3.4.41. 
Specifically, the equality of the confidence level of 9 S 0' to the coverage rate of the corresponding 
confidence set guarantees convergence in probability to 1 if 9 is in the interior of 0' or to if 9 ^ 0' 
(Proposition 20 1. 
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